System and method for determining the start of a match of a regular expression

ABSTRACT

A system for determining the start of a match of a regular expression includes a special state table that contains start entries and terminal entries, and a set of start state registers for holding offset information. The system further includes a DFA next state table that, given the current state and an input character, returns the next state. A settable indicator is included in the DFA next state table corresponding to each next state table entry which indicates whether to perform a lookup in the special state table. A compiler loads values into the special state table based on the regular expression. A method for determining the start of a match of a regular expression using the special state table, the set of start state registers and the DFA next state table, includes the step of determining from the regular expression each start-of-match start state and each end-of-match terminal state. For each start state, a start state entry is loaded into the special state table. For each terminal state, a terminal state entry is loaded into each special state table. The next state table is used to return the next state from the current state and an input character. When a start state is encountered, the current offset from the beginning of the input character string is loaded into the start state register. When a terminal state is encountered, the terminal state entry is retrieved from the special state table, and the value of the start state register corresponding to the rule number of the terminal entry in the special state table is further retrieved. The value of the start state register which is retrieved indicates the location in the character string where the start-of-match occurred for a particular rule.

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is related to co-pending U.S. provisional patentapplication Serial No. 60/445,620, filed Feb. 7, 2003, and entitled“System and Method for Determining the Start of a Match of a RegularExpression”, the disclosure of which is incorporated herein byreference.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] This invention generally relates to pattern recognition ofcharacter strings using regular expressions, and more particularlyrelates to methods and engines for searching character strings forpatterns and determining the start of a matching pattern.

[0004] 2. Description of the Prior Art

[0005] Regular expressions are formuli used for matching characterstrings that follow some pattern. They are made up of normal characters,such as upper and lower case letters and numbers, and “metacharacters”,which are symbols, such as /*|[ ], or the like, that have specialmeanings. Regular expressions are well known in the art, and for a morecomplete explanation of what they are and how they are used in patternmatching, reference should be made to Mastering Regular Expressions, byJeffrey E. F. Friedl, published by O'Reilly and Associates, Inc., thedisclosure of which is incorporated herein by reference.

[0006] Two different regular expression (“regex”) engines commonly usedfor searching for patterns in a character string are a non-deterministicfinite state automaton (NFA) and a deterministic finite state automaton(DFA). Again, reference should be made to the aforementionedpublication, Mastering Regular Expressions, for a more completeexplanation of how an NFA and a DFA function.

[0007]FIG. 1 illustrates one conventional pattern matching scheme usingeither an NFA or a DFA. In this example, the pattern to be matched isexpressed as the regex (a*|b)x. The character string being sampled iseight characters long, for this particular illustrative example.

[0008] In the example shown in FIG. 1, the first step (Step 1) in thisconventional method of pattern matching is where the pattern is anchoredat the first character in the string, which is character no. 0 and whichis the character “a”. The matcher (i.e., the NFA or DFA) consumescharacters until it reaches a failure state, which for the first step(Step 1) in the method occurs at character no. 6 in the string (which isthe lower case letter “b”). In the example, it should be noted that “m”represents a successful match, “f” represents that the match has failed,and “M” represents that the match is successful.

[0009] In the second step (Step 2) of this method of pattern matching,the pattern is now anchored at the second character in the string (i.e.,character no. 1), which is also the lower case letter “a”. The patternbegins matching at character no. 1 and, again, fails at character no. 6(i.e., the seventh character in the string), which is the lower caseletter “b”. Thus, it should be noted that the pattern matcher (i.e., theNFA or DFA), in Step 2, has now gone over six characters that havealready been considered in Step 1 of the pattern matching method. Thus,for a character string of eight characters, and for the given pattern of/(a*|b)x/, expressed as a regex, 29 characters must be considered beforea match is found. As shown in FIG. 1, the match occurs in Step 7, wherethe pattern is anchored at character no. 6.

[0010] The advantage of this scheme is that the start and the end of thematch are known. The disadvantage is that, in the worse case situation,n² characters must be considered, where n is the length of the inputstring. Thus, if m patterns are to be considered simultaneously usingthis conventional method, and a separate pass is made on the inputstring for each pattern, the total number of comparisons performed ism×n².

[0011] Another method of pattern matching using regular expressions isdescribed below. If, for example, there were two patterns, one of whichis expressed by the regex /(a*|b)x/, as in the example given above andshown in FIG. 1, and the other pattern is the regex /pqr/, the twopatterns may be combined into the following pattern: /.*(a*|b)x|.*pqr/

[0012] This particular pattern will succeed only if either of theoriginal patterns succeed (i.e., are matched), and the end of the matchfor this combined pattern will occur in the same place as if theoriginal patterns were searched individually. What is more, the patternmatcher will find the match after considering at most n characters,since the pattern is anchored at the first character and will run fromthere.

[0013] The problem, however, with this second pattern matching scheme isthat it is unclear where the start of match occurs. (The end of thematch is known, as the matcher knows the character number when aterminal or accepting state is reached.)

OBJECTS AND SUMMARY OF THE INVENTION

[0014] It is an object of the present invention to provide a method formatching a pattern in a character string.

[0015] It is another object of the present invention to provide a methodof pattern matching which determines the start of a match of a patternexpressed as a regular expression.

[0016] It is still another object of the present invention to provide asystem for matching a pattern in a character string and for determiningthe start of the match.

[0017] It is a further object of the present invention to provide ahardware engine that supports the pattern matching method of the presentinvention.

[0018] It is still a further object of the present invention to providea regular expression to DFA compiler that produces transition and othertables for the hardware engine.

[0019] It is yet a further object of the present invention to provide asystem and method for determining the start of a match of a regularexpression which overcomes the disadvantages inherent with conventionalsystems and pattern matching methods.

[0020] In one form of the present invention, a system for determiningthe start of a match of a regular expression includes a special statetable that contains start entries and terminal entries, and a set ofstart state registers for holding offset information. The system furtherincludes a DFA next state table that, given the current state and aninput character, returns the next state. A settable indicator isincluded in the DFA next state table corresponding to each next statetable entry which indicates whether to perform a lookup in the specialstate table. A compiler loads values into the special state table basedon the regular expression.

[0021] A method in accordance with one form of the present invention fordetermining the start of a match of a regular expression using thespecial state table, the set of start state registers and the DFA nextstate table, includes the step of determining from the regularexpression each start-of-match start state and each end-of-matchterminal state. For each start state, a start state entry is loaded intothe special state table. For each terminal state, a terminal state entryis loaded into each special state table. The next state table is used toreturn the next state from the current state and an input character.When a start state is encountered, the current offset from the beginningof the input character string is loaded into the start state register.When a terminal state is encountered, the terminal state entry isretrieved from the special state table, and the value of the start stateregister corresponding to the rule number of the terminal entry in thespecial state table is further retrieved. The value of the start stateregister which is retrieved indicates the location in the characterstring where the start-of-match occurred for a particular rule.

BRIEF DESCRIPTION OF THE DRAWINGS

[0022]FIG. 1 is an illustrative block diagram representation of aconventional method for matching a regular expression.

[0023]FIG. 2 is a block diagram which illustrates in accordance with oneform of the present invention the hardware used to carry out the methodof determining the start of a match of a regular expression.

[0024]FIG. 3 is a state transition diagram, in block diagram form, of anillustrative example of how the system and method of the presentinvention operate.

[0025]FIG. 4 schematically represents, in block diagram form, theoperation of the system and method of the present invention indetermining the start of a match of each rule of the DFA illustrated bythe state transition diagram shown in FIG. 3.

[0026]FIG. 5 is a partial state transition diagram illustrating one stepin the method for producing a final multi-rule DFA.

[0027]FIG. 6 is a state transition diagram of an illustrative example,showing how a compiler formed in accordance with the present inventiondetermines the start-of-match states for a particular regularexpression.

[0028]FIG. 7 is a block diagram of a system for matching a regularexpression formed in accordance with one form of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0029] It was previously described with respect to the secondconventional method of pattern matching that, if m patterns are combinedinto a single DFA, the end of each match for each pattern can bedetermined in a single pass, i.e., after considering at most ncharacters, where n is the length of the character input string. If theDFA is implemented in hardware, the matches can be performed at highrates, e.g., in the gigabit range and higher. The system and method ofthe present invention incorporate these advantages, and further have theability to find the start-of-match location for each of r patterns,where r is less than or equal to m and is limited only by the practicalhardware constraints of the system. The methodology of the presentinvention will be described herein, as well as a hardware engine thatallows the implementation of the present method for determining thestart-of-match of a regular expression, and a modified regularexpression to DFA compiler that produces transition tables and the likefor the hardware engine.

[0030] Turning initially to FIG. 2 of the drawings, a preferredembodiment of the hardware engine formed in accordance with the presentinvention is schematically depicted. The hardware engine first includesa DFA next state table 2, also commonly referred to as a statetransition table. The DFA next state table 2 is similar in may respectsto a conventional transition table in that, given the current state ofthe DFA and an input character from a character string, it returns thenext state. However, in accordance with the present invention, the DFAnext state table 2 includes a special bit for each next state entry.

[0031] More specifically, the DFA next state table, as shown in FIG. 2,includes a plurality of columns and a plurality of rows. Each column isheaded by a character (0, 1, 2, . . . 255), which represents, forexample, each of the alphanumeric characters and other symbols one wouldfind on a computer keyboard and possibly elsewhere. The input charactersmay be represented by a seven or eight bit ASCII (American Standard Codefor Information Interchange) code. For example, character no. 97 mayrepresent the lower case letter “a”, and character no. 98 couldrepresent the lower case letter “b”. Thus, input character “a” and inputcharacter “b” would each head up one column in the DFA next state table.As in a conventional transition table, the rows are designated by thecurrent states of the DFA. The intersection of a current state row and acurrent input character column defines the next state of the DFA, whichmay be represented by a binary code. The DFA next state table 2 may bestored in a memory, such as a read-only memory (ROM) or a random accessmemory (RAM), or in another memory of the computer or other device whichis used as the pattern matcher. The memory is generally referred tohereinafter as the automaton memory 3, as it is operatively associatedwith the finite state automaton.

[0032] In accordance with the present invention, the DFA next statetable 2 further includes a special bit appended to each next state inthe table. The special bit, which may be a binary bit, such as a binary“0” or a “1”, signifies that the next state in the transition table is a“special” state, in that it is either a start state, a terminal state,or both. When the special bit is set, such as by having a binary “1” inthe special bit position, a lookup is performed in a special state table4, which forms part of the present invention. As shown in FIG. 2, thisspecial state table 4 includes at least one of two types of entries. Thefirst is a start entry 6 and the second is a terminal entry 8. It shouldbe realized that the special state table may include a start entry and aterminal entry corresponding to the same state in the special statetable.

[0033] The special state table 4 is a one or more dimensional arraycontaining information about each state which is considered a “special”state. In other words, in the DFA next state table 2, if the next stateis determined to be a “special” state, denoted by the special bit beingset, a lookup is performed in the special state table 4 for informationconcerning that designated special state. The information is preferablya 16 bit word for each special state, although it may be longer orshorter, as required.

[0034] If the special state is a start state, then the 16 bit wordcontained in the special state table 4 for that particular state hasstart entry information. If the special state is a terminal state, thenthe 16 bit word for that particular special state includes terminalentry information.

[0035] As shown in FIG. 2, the start entry information includes,preferably, a four bit opcode, which indicates whether the state is astart state or a terminal state. Of course, it should be realized that agreater or lesser number of bits than four may be included as signifyingthe opcode. Only one bit is actually required, but the opcode may serveother purposes.

[0036] The start entry information further preferably includes 12additional bits which define a “start state register select” code. Eachbit of the start state register select code will either be a binary “1”or a binary “0”, and will designate a particular rule number or pattern(i.e., regular expression) that is to be matched. In the example shownin FIG. 2, there are 12 start state register select bits and,accordingly, there are 12 possible patterns that may be matched in thisparticular DFA. However, as mentioned previously, the start entry 6 maybe longer or shorter than 16 bits and, correspondingly, the DFA and mayinclude more or less than 12 patterns that are being matched.

[0037] If the special state is a terminal state, then the preferred 16bit word stored in the special state table 4 for that particular statewill have terminal entry information, as shown in FIG. 2. The preferred16 bit terminal entry 8 includes a four bit opcode, which indicateswhether that special state is a start state or a terminal state, orboth. The remaining 12 bits of the terminal entry 8 designates theparticular rule number of the pattern to which that terminal staterelates, and the start state register number, which also wouldcorrespond to the start state register select code of the start entryinformation.

[0038] The special state table 4 is, essentially, a lookup tablecontaining binary information. Accordingly, like the DFA next statetable 2, it too may be stored in a ROM or RAM, or more generally, in theautomaton memory.

[0039] As shown in FIG. 2, the hardware engine of the present inventionwould further include a plurality of start state registers 10, shown asa column in FIG. 2. Each register corresponds to a particular rulenumber or pattern being matched by the DFA. There are 12 registers whichare shown by way of example in FIG. 2 for the start state registers 10.However, it should be understood that there may be more or lessregisters than that shown in FIG. 2, depending upon the number ofpatterns or rules being matched, preferably simultaneously, by the DFA.

[0040] In each start state register 10, there is correspondinginformation, in the form of a binary word, stored therein which denotesthe offset from the beginning of the character string being tested, inother words, the distance, in characters, from the beginning of theinput character string to the current character which caused the DFA totransition to a start state corresponding to that particular rule numberor pattern. This offset information, accordingly, signifies and defineswhere in the character string the start of a match for that particularrule or pattern is located. It should be noted that the end-of-match isalready known, as it is the location of the current character in thecharacter string which caused the DFA to transition to a terminal state,and this location is known by the pattern matcher. Thus, in accordancewith the present invention, the start and end for each regularexpression, or pattern, may be determined.

[0041] An example illustrating how the method and hardware engine usedfor determining the start of a match of a regular expression operates isshown in FIGS. 3 and 4. More specifically, FIG. 3 shows the DFA for thepair of rules (i.e., regular expressions, or patterns) /.*ab.*cd/ and/.*xy/, which are referred to herein as Rule 1 and Rule 2, respectively.For each terminal state, which can be seen from the DFA in FIG. 3 asbeing states 1, 3 and 8, a terminal state entry 8 is added to thespecial state table, and the corresponding special bits for those states(as next states) are set in the DFA next state table 2. As can be seenfrom the state transition diagram of FIG. 3, state 1 is a terminal statefor Rule 2, state 3 is a terminal state for Rule 1 and state 8 is also aterminal state for Rule 2. Thus, the terminal entry 8 in the specialstate table 4 for state 1 would designate Rule 2 as the particular rulenumber of the pattern to which that terminal state relates, and wouldfurther designate the start state register number as “2”. Similarly, theterminal entry 8 for state 3 would designate Rule 1 as the rule numberand “1” as the start state register number, and the terminal entry 8 forstate 8 would designate Rule 2 as the rule number “2” as the start stateregister number.

[0042] As can further be seen from the DFA of FIG. 3, there are alsothree start states, that is, states 7, 5 and 2. The correspondingspecial bits for each of states 7, 5 and 2 (as next states) are set inthe DFA next state table 2. Furthermore, for state 7, the correspondingstart entry has its start state register select code with a particularlydesignated bit for Rule 1 on, since this would be the reported startstate if Rule 1 matched at state 3, which is a terminal state forRule 1. For state 5, the corresponding start entry has its start stateregister select code with a particularly designated bit for Rule 2 set,and for state 2, the corresponding start entry would also have the Rule2 bit set in its start state register select code. It should be notedthat, because of the metacharacter “.*” construct between the twopatterns (i.e., Rule 1 and Rule 2) in the regular expression, the startstate for Rule 2 at start 8 actually occurs at state 2, which is quitefar from the global start state, i.e., state 0.

[0043] For the DFA of FIG. 3, the DFA next state table 2 shown in FIG. 4has been selectively completed with the more pertinent information tofacilitate an understanding of the invention. It should be noted that inthe DFA next state table, the special bits are set with a binary “1” foreach next state which is designated as a special state. This includesthe start states (i.e., states 7, 5 and 2), and the terminal states(i.e., states 1, 3 and 8). The special bits for all other next stateswhich are not considered special states are not set, as reflected by thebinary “0” for each corresponding special bit.

[0044] Assume that the first character in the character string inputtedto the DFA is a lower case “a”, which is no. 97 in ASCII code, or inbinary would be 01100001. It should be remembered that, because of themetacharacter “.*” construct of the two regular expressions (i.e., Rule1 and Rule 2), zero or more characters may precede either rule in thecharacter string. However, to simplify the explanation of the invention,it will be assumed that a lower case “a” is the first character in theinput character string.

[0045] In accordance with the DFA next state table 2, and as clearlyshown in the state transition diagram for the DFA in FIG. 3, for the rowheaded by current state 0 and the column headed by no. 97, correspondingto the current input character “a”, the next state when an “a” isreceived would be state 7. Since state 7 is a start state for theregular expression /.*ab.*cd/ (Rule 1), the special bit will be set to abinary “1” in the DFA next state table next to the entry for state 7.

[0046] This special bit, being set to a binary “1”, indicates that thatparticular next state (state 7) is a special state. In accordance withthe method of the present invention, a lookup is performed in thespecial state table 4. As shown in FIG. 4, the special state table forstate 7 includes a start entry 6, since state 7 is a start state. Thestart entry 6 would have a four bit opcode, such as “0001”, indicatingthat state 7 is a start state. Furthermore, the start entry would havebits 5-16 as being “010000000000” as the start state register select.This code would indicate that state 7 is a start state for Rule 1, sincethe second bit in from the beginning of the start state register selectcode would be on (e.g., a binary “1”) in the bit slot for Rule 1.

[0047] The hardware engine would then go to the start state registers10, and for the register corresponding to Rule 1, the current offsetfrom the beginning of the input character string would be entered inthat register. In this case, since a lower case “a” was received as thefirst character in the string, the start state register for Rule 1 wouldhave a binary “000” entered into it, which would indicate that the startof a match for Rule 1 (i.e., the first regular expression or patterndescribed previously) occurred on the first character in the characterstring, with 0 offset.

[0048] Now, assume that the next character in the input character stringis a lower case “b”. As can be seen from the transition diagram of FIG.3, a lower case “b” as an input character would cause the DFA to go fromstate 7 to state 4. It should be noted that state 4 is neither a startstate nor a terminal state.

[0049] Turning now to the partially completed DFA next state table 2shown in FIG. 4, for this particular example, for the row headed bycurrent state 7 and the column headed by current character no. 98 (alower case “b” is number 98 in an ASCII code, or in binary, 01100010),the next state at the intersection of that particular row and column isdesignated as state 4. Since, as mentioned previously, state 4 is not aspecial state in that it is neither a start state nor a terminal state,the special bit corresponding to state 4 is not set and is designated bya binary “0”. There would be no entry in the special state table forstate 4, as it is not a special state, and no lookup is performed in thespecial state table 4, since the special bit corresponding to state 4 inthe next state table is not set (it is a binary “0”).

[0050] Next, assume that a lower case “x” is the next character in theinput character string. According to the transition diagram of FIG. 3, alower case “x” as the next character would cause a transaction fromstate 4 to state 2. State 2 is a start state for Rule 2, that is, theregular expression /.*xy/. In the DFA next state table 2 shown in FIG.4, for the row headed by current state 4 and the column headed bycurrent character no. 120 (a lower case “x” is no. 120 in ASCII code, orin binary, it would be 01111000), the table would yield a next state asstate 2. Since state 2 is a start state for Rule 2, a special bit willbe set in the DFA next state table 2 adjacent to next state entry (state2), such as by having the special bit as a binary “1”. Since the specialbit is set, indicating that state 2 is a special state, a lookup isperformed in the special state table 4 for state 2.

[0051] Since state 2 is a start state, a start entry 6 would be found inthe special state table 4 corresponding to state 2. The start entry 6would have an opcode indicating that state 2 is a start state, such asby the binary code 0001. The start entry would further have a 12 bitstart state register select code following the opcode in which the Rule2 bit slot would be set with a binary “1”, so that the start stateregister select 12 bit code would appear as “001000000000”. Thus, thestart entry for state 2 would indicate that state 2 is a start state forRule 2, i.e., the second regular expression or pattern describedpreviously.

[0052] The start state registers 10 are then accessed and, asillustrated by FIG. 4, the current offset for the register correspondingto Rule 2 for when the lower case “x” appeared in the input characterstring, from the start of the string, would be entered. In thisparticular example, the lower case “x” was received two characters afterthe beginning of the input character string. Accordingly, a binary “010”would be entered into the start state register for Rule 2.

[0053] To complete the example, assume that the next character in theinput character string is a lower case “y”. As can be seen from thetransition diagram of FIG. 3, a lower case y as the next character wouldcause a transition from state 2 to state 8. State 8 is a terminal statefor Rule 2, which is the regular expression /.*xy/. Turning now to theDFA next state table 2 shown in FIG. 4, for the row headed by currentstate 2 and the column headed by current character no. 121 (a lower case“y” is number 121 in ASCII code, or a binary 01111001), the intersectionof that particular row and that particular column would yield a nextstate as state 8. Adjacent state 8 as the next state in the table wouldbe its corresponding special bit, which would be set, as indicated by abinary “1”. This is because state 8 is a special state.

[0054] In accordance with the method and hardware engine of the presentinvention, a lookup is now performed in the special state table 4. Forstate 8, the special state table 4 would include a terminal entry 10 aspreferably a 16 bit word, since state 8 is a terminal state for Rule 2.For example, the terminal entry 10 would have a four bit opcode of 0010,or 0000, or any desired code indicating that state 8 is a terminalstate. Following the opcode would be a “rule number” code, indicatingthe rule number for which state 8 is a terminal state. The rule numbermay be, for example, a six bit binary code which, in this case, could bethe binary “000010”, which would correspond to and indicate Rule 2 asbeing the rule number for which state 8 is a terminal state. Followingthe rule number code in the terminal entry 8 is the start state registernumber code, which would indicate the start state register correspondingto Rule 2. This may also be a six bit code, for example, and in thisparticular example, the start state register number would be representedby the binary code “000010”.

[0055] The pattern matcher of the present invention now looks in thestart state register 10 for Rule 2 to find the current offset storedtherein. As stated before, the offset stored in the start state registerfor Rule 2 is the binary code “0010”, which indicates that the start ofthe regular expression /.*xy/, that is, Rule 2, occurred at twocharacters from the beginning of the input character string.Accordingly, not only does the hardware engine know the location of theend-of-match for the second regular expression (Rule 2), because itoccurred on the current input character, it also now knows thestart-of-match location in the input character string for thisparticular pattern.

[0056] It should be realized that the number of bits described hereinfor each entry in the special state table 4, including the start entry6, the opcode, the start state register select, the terminal entry 8,the rule number and the start state register number, as well as for thecurrent offset information stored in the start state registers 10, isdescribed for illustrative purposes only, and may be a lesser, orgreater, number of bits. For a typical TCP/IP character string to besearched, there are usually at most approximately 1500 characters. Thiswould mean that, if the hardware engine and methodology of the presentinvention is to be applied to search patterns in such a typical TCP/IPcharacter string, then the start state registers should be capable ofstoring about 13 bits of offset information or more in each register.

[0057] It should be understood that a compiler 12 generates the DFA nextstate table 2, the special state table 4 and the entries therein for thehardware engine that supports the start-of-match methodology of thepresent invention, knowing the patterns desired to be matched. Thecompiler 12 finds the start states and the terminal states and loads thecorresponding start entries 6 and terminal entries 8 into the specialstate table 4 accordingly. A regular expression to DFA compiler, formedin accordance with the present invention, will now be described.

[0058] In the following text, the method by which the compiler 12determines which DFA states are start states is presented. First, ageneral outline of the conversion of regular expressions into DFA's ispresented, and then the modifications to the process necessary forlabeling states as start states, in accordance with the presentinvention, is presented.

[0059] The production of a final multi-rule DFA is performed in severalstages. First, each rule has the metacharacters “.*” prepended to it andis transformed to an NFA using the well-known Thompson Construction. Fora description of metacharacters and the Thompson Construction, referenceshould be made to Compilers, by A. V. Aho, R. Sethi, and J. D. Ullman,published by Addison-Wesley Publishing Company, 1986, the disclosure ofwhich is incorporated herein by reference.

[0060] Second, each single rule NFA is converted into a DFA using thestandard NFA to DFA algorithm, which is also commonly referred to as thesubset construction algorithm. This algorithm creates a DFA state fromone or more NFA states. For a more detailed explanation of thisalgorithm, reference should again be made to the aforementionedpublication, Compilers.

[0061] The third step in the production of the final multi-rule DFA isto create a new NFA start state, and to insert an epsilon transitionfrom this new NFA start state to each of the DFA's for each rule. Thisthird step is illustrated by FIG. 5. This step creates a new, single“meta-NFA”.

[0062] The fourth step in the process is to convert the meta-NFA to aDFA, again using the well-known subset construction algorithm.

[0063] The above-described procedure is modified in accordance with thepresent invention in the following manner for rules for whichstart-of-match data is requested. After an NFA is produced in the firststep mentioned previously for each rule, it is analyzed for NFAstart-of-match states. NFA start-of-match states are found as follows.

[0064] Starting at the initial state, an epsilon closure is generated.The 1-closure of that epsilon closure is then generated, and all statesin the 1-closure but not in the initial epsilon closure are labeled asNFA start-of-match states.

[0065] In the second step of the production of the final multi-rule DFAmentioned previously, the NFA is converted to a DFA for each rule. EveryDFA state that contains an NFA start state is a potential DFA startstate for that particular rule. For all potential start states of aparticular rule, the distance to the global start state (usually, theinitial start state) is found. The closest potential start state to theglobal start state is chosen as a start state for that particular rule.If multiple potential start states are at the same distance from theglobal start state, they are all accepted as DFA start states. Finally,the chosen start states are carried through the third and fourth stepsmentioned previously for producing the final multi-rule DFA.

[0066] The following is an example of how the modified regularexpression to DFA compiler, formed in accordance with the presentinvention, operates. Assume that an input character stream is beingsearched for the unanchored regular expression “adam”. Only a singlerule is used in this example to facilitate an understanding of theinvention. The expression is unanchored in the sense that it can occuranywhere in the character stream.

[0067] In accordance with the first step of the present invention, thecompiler 12 prepends the metacharacters “.*” to the rule so that theregular expression becomes /.*adam/. The regular expression is thentransformed to an NFA using the Thompson Construction. The NFA that isproduced is illustrated by FIG. 6.

[0068] In accordance with the present invention, the epsilon closure ofthe NFA initial state 0 is generated. This epsilon closure contains NFAstates 1, 2 and 3, as shown in FIG. 6. The 1-closure of that epsilonclosure is now generated, and all states in the 1-closure but not in theinitial epsilon closure are labeled as NFA start-of-match states. Asshown in FIG. 6, the 1-closure of the NFA states 1, 2 and 3 includestates 2, 3 and 4. Since state 4 is the only state in the 1-closure andnot in the epsilon closure, it is the only NFA start-of-match state.

[0069] This procedure is repeated for each rule, and the third andfourth steps in the production of the final multi-rule DFA, i.e.,creating a “meta-NFA” and converting it to a DFA, respectively, are nowperformed.

[0070] Through the above-described procedure, the compiler 12 of thepresent invention has now generated the proper values to place in theDFA next state table 2 and the special state table 4 to permit thepattern matcher to determine the location of the start of a match in aninput character string for each regular expression.

[0071]FIG. 7 illustrates a system formed in accordance with the presentinvention used for determining the start of a match of a regularexpression. The system preferably includes some or all of the componentsdescribed previously, such as a compiler 12, a finite state automaton14, for example, the deterministic (or a non-deterministic) finite stateautomaton, and an automaton memory 3, each of which is preferablyoperatively linked to, and communicates with, one another.

[0072] As is seen from the above description, the present inventionprovides a method and system for matching a pattern in a characterstring and determining the start of the match. The method and systemadvantageously finds the start-of-match data for each rule of amulti-rule DFA in a single pass. What is more, this system and methodwill find a match after considering at most n characters, where n is thelength of the input character string. Furthermore, if the DFA isimplemented in hardware, the method and system of the present inventioncan perform the matches at gigabit and higher rates.

[0073] Although illustrative embodiments of the present invention havebeen described herein with reference to the accompanying drawings, it isto be understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may beeffected therein by one skilled in the art without departing from thescope or spirit of the invention.

What is claimed is:
 1. A system for determining the start of a match ofa regular expression, comprising: a special state table which containsstart state entries and terminal state entries; a plurality of startstate registers for storing offset information indicative of the startof a match of the regular expression; a deterministic finite stateautomaton (DFA) next state table which, given the current state and aninput character, returns the next state, the DFA next state tableincluding a settable indicator for any next state table entry whichindicates whether to perform a lookup into the special state table; anda compiler which loads values into the special state table based on theregular expression.
 2. A system for determining the start of one or morepatterns of characters in an input character string, the patterns beingdefined by at least one character of the input character string, theinput character string being provided to the system, the systemoperating in a series of states, the series of states including at leastone start state and at least one terminal state, the system comprising:finite state automaton, the finite state automaton being responsive toeach character of the input character string and selectivelytransitioning to a next state in response to each character; anautomaton memory having stored therein a state transition table and aspecial state table; the special state table including special stateinformation; the special state information including start state entriesand terminal state entries, the special state information having atleast a first code to indicate whether the special state information isa start state entry or a terminal state entry, each start state entryincluding a start state register select code, each terminal state entryincluding a second code identifying the one or more particular patterns,and a start state register number code; and a plurality of start stateregisters, each register of the plurality of start state registers beingidentifiable by the start state register number code and having storedtherein information relating to the location in the input characterstring of the start of a particular pattern of the one or more patterns;the state transition table including current state informationcorresponding to the current state of the finite state automaton,character information corresponding to the characters in the inputcharacter string, next state information relating to the next state towhich the finite state automaton will transition in response to thecurrent state information and the character information, and specialstate table information corresponding to the next state information andindicating whether the system should perform a lookup in the specialstate table.
 3. A system as defined by claim 2, which further comprises:a compiler cooperatively linked to the automaton memory, the compilergenerating the special state information in the special state table andthe current state information, character information, next stateinformation and special state table information in the state transitiontable.
 4. A system as defined by claim 2, wherein the information storedin each register of the plurality of start state registers is offsetinformation which corresponds to the position of a character in theinput character string which resulted in the next state being a startstate.
 5. A method of determining the start of a match of a regularexpression using a system having a special state table, a plurality ofstart state registers and a deterministic finite state automaton nextstate table, the method comprising the steps of: determining, from theregular expression, each start state and each terminal state of a matchof the regular expression; loading a start state entry into the specialstate table for each start state; loading a terminal state entry intothe special state table for each terminal state; determining a nextstate from a current state and an input character from an inputcharacter string; loading a current offset from the beginning of theinput character string into the start state register when a start stateis encountered; and retrieving from the special state table the terminalstate entry and retrieving the current offset from the start stateregister pertaining to the match of the regular expression when aterminal state is encountered.
 6. A method for determining the start ofone or more patterns of characters in an input character string, thepatterns being defined by at least one character of the input characterstring, the input character string being provided to a system having afinite state automaton, an automaton memory operatively linked to thefinite state automaton, and a plurality of start state registersoperatively linked to the automaton memory and finite state automaton,the system operating in a series of states, the series of statesincluding at least one start state and at least one terminal state, themethod comprising the steps of: providing each character of the inputcharacter string to the system such that the finite state automaton isresponsive thereto and selectively transitions from a current state to anext state in response to each character; storing in the automatonmemory a state transition table and a special state table, the specialstate table including special state information, the special stateinformation including start state entries and terminal state entries,the special state information having at least a first code to indicatewhether the special state information is a start state entry or aterminal state entry, the state transition table including current stateinformation corresponding to the current state of the finite stateautomaton, character information corresponding to the characters in theinput character string, next state information relating to the nextstate to which the finite state automaton will transition in response tothe current state information and the character information, and specialstate table information corresponding to the next state information andindicating whether the system should perform a lookup in the specialstate table; storing in each register of the plurality of start stateregisters information relating to the location in the input characterstring of the start of a particular pattern of the one or more patterns;determining from the state transition table whether the next state is aspecial state in response to an input character of the input characterstring; performing a lookup in the special state table if the next stateis determined to be a special state; reading special state informationin the special state table in response to the lookup performed in thespecial state table; determining from the special state informationwhether the next state is at least one of a start state and a terminalstate; loading current offset information into the start state registerif the next state is a start state, the current offset informationcorresponding to the position of a character in the input characterstring which resulted in the next state being a start state; andretrieving from the special state table the special state information,and retrieving the current offset information from at least one registerof the plurality of start state registers when the next state isdetermined to be a terminal state.
 7. A method for determining the startof one or more patterns of characters in an input character string, thepatterns being defined by at least one character of the input characterstring, the input character string being provided to a system having afinite state automaton, an automaton memory operatively linked to thefinite state automaton, and a plurality of start state registersoperatively linked to the automaton memory and finite state automaton,the system operating in a series of states, the series of statesincluding at least one start state and at least one terminal state, themethod comprising the steps of: providing each character of the inputcharacter string to the system such that the finite state automaton isresponsive thereto and selectively transitions from a current state to anext state in response to each character; storing in the automatonmemory a state transition table and a special state table, the specialstate table including special state information, the special stateinformation including start state entries and terminal state entries,the special state information having at least a first code to indicatewhether the special state information is a start state entry or aterminal state entry, each start state entry including a start stateregister select code, each terminal state entry including a second codeidentifying the one or more patterns, and a start state register numbercode, the state transition table including current state informationcorresponding to the current state of the finite state automaton,character information corresponding to the characters in the inputcharacter string, next state information relating to the next state towhich the finite state automaton will transition in response to thecurrent state information and the character information, and specialstate table information corresponding to the next state information andindicating whether the system should perform a lookup in the specialstate table; storing in each register of the plurality of start stateregisters information relating to the location in the input characterstring of the start of a particular pattern of the one or more patterns;determining from the state transition table whether the next state is aspecial state in response to an input character of the input characterstring; performing a lookup in the special state table if the next stateis determined to be a special state; reading at least one of the startstate entries and the terminal state entries in response to the lookupperformed in the special state table; determining from the at least oneof the start state entries and the terminal state entries whether thenext state is at least one of a start state and a terminal state;loading current offset information into the start state register if thenext state is a start state, the current offset informationcorresponding to the position of a character in the input characterstring which resulted in the next state being a start state; andretrieving from the special state table the terminal state entry, andretrieving the current offset information from at least one register ofthe plurality of start state registers when the next state is determinedto be a terminal state.
 8. A method for determining the start states ofeach rule of a plurality of rules and generating a multi-ruledeterministic finite state automaton (DFA), which comprises the stepsof: prepending to each rule the metacharacters “.*” and transformingeach rule prepended with the metacharacters to a non-deterministicfinite state automaton (NFA) using a Thompson Construction; analyzingthe NFA for each rule to determine NFA start states by the followingsubsteps: a) producing an epsilon closure starting at the initial stateof the NFA; b) producing a 1-closure of the initial epsilon closure; c)comparing the states in the initial epsilon closure with the states inthe 1 closure; and d) determining as NFA start states all states in the1-closure which are not in the initial epsilon closure; converting theNFA for each rule into a DFA using an NFA to DFA algorithm, therebycreating a DFA state from one or more NFA states; determining for eachDFA state whether it contains an NFA start state; for each DFA statethat contains an NFA start state, determining the distance of the DFAstate from the global start of the DFA for each rule; comparing thedistances of the DFA start states that contain an NFA start state fromthe global start state; choosing as a DFA start state the DFA statecontaining an NFA start state which is closest to the global startstate; if more than one DFA state containing an NFA start state have thesame closest distance to the global start state, accepting as DFA startstates each of said closest DFA states having the same closest distanceto the global start state; creating a new NFA start state and insertingan epsilon transition from the new NFA start state to each of the DFA'sfor each rule, thereby creating a meta-NFA; and converting the meta-NFAto a final multi-rule DFA.